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Abstract 



This paper extends the framework of partially observable Markov decision processes (POMDPs) 
to multi-agent settings by incorporating the notion of agent models into the state space. Agents 
maintain beliefs over physical states of the environment and over models of other agents, and they 
use Bayesian updates to maintain their beUefs over time. The solutions map belief states to actions. 
Models of other agents may include their belief states and are related to agent types considered in 
games of incomplete information. We express the agents' autonomy by postulating that their mod- 
els are not directly manipulable or observable by other agents. We show that important properties 
of POMDPs, such as convergence of value iteration, the rate of convergence, and piece-wise linear- 
ity and convexity of the value functions carry over to our framework. Our approach complements a 
more traditional approach to interactive settings which uses Nash equilibria as a solution paradigm. 
We seek to avoid some of the drawbacks of equilibria which may be non-unique and do not capture 
off-equilibrium behaviors. We do so at the cost of having to represent, process and continuously 
revise models of other agents. Since the agent's beUefs may be arbitrarily nested, the optimal so- 
lutions to decision making problems are only asymptotically computable. However, approximate 
belief updates and approximately optimal plans are computable. We illustrate our framework using 
a simple application domain, and we show examples of belief updates and value fimctions. 

1. Introduction 

We develop a framework for sequential rationality of autonomous agents interacting with other 
agents within a common, and possibly uncertain, environment. We use the normative paradigm of 
decision-theoretic planning under uncertainty formalized as partially observable Markov decision 
processes (POMDPs) (BoutiUer, Dean, & Hanks, 1999; Kaelbling, Littman, & Cassandra, 1998; 
Russell & Norvig, 2003) as a point of departure. Solutions of POMDPs are mappings from an 
agent's beliefs to actions. The drawback of POMDPs when it comes to environments populated by 
other agents is that other agents' actions have to be represented implicitly as environmental noise 
within the, usually static, transition model. Thus, an agent's beliefs about another agent are not part 
of solutions to POMDPs. 

The main idea behind our formalism, called interactive POMDPs (1-POMDPs), is to allow 
agents to use more sophisticated constructs to model and predict behavior of other agents. Thus, 
we replace "flat" beliefs about the state space used in POMDPs with beliefs about the physical 
environment and about the other agent(s), possibly in terms of their preferences, capabilities, and 
beliefs. Such beliefs could include others' beliefs about others, and thus can be nested to arbitrary 
levels. They are called interactive beliefs. While the space of interactive beliefs is very rich and 
updating these beliefs is more complex than updating their "flat" counterparts, we use the value 
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function plots to show that solutions to I-POMDPs are at least as good as, and in usual cases superior 
to, comparable solutions to POMDPs. The reason is intuitive - maintaining sophisticated models of 
other agents allows more refined analysis of their behavior and better predictions of their actions. 

I-POMDPs are applicable to autonomous self-interested agents who locally compute what ac- 
tions they should execute to optimize their preferences given what they believe while interacting 
with others with possibly conflicting objectives. Our approach of using a decision-theoretic frame- 
work and solution concept complements the equilibrium approach to analyzing interactions as used 
in classical game theory (Fudenberg & Tirole, 1991). The drawback of equilibria is that there could 
be many of them (non-uniqueness), and that they describe agent's optimal actions only if, and when, 
an equilibrium has been reached (incompleteness). Our approach, instead, is centered on optimality 
and best response to anticipated action of other agent(s), rather then on stability (Binmore, 1990; 
Kadane & Larkey, 1982). The question of whether, under what circumstances, and what kind of 
equiUbria could arise from solutions to I-POMDPs is currently open. 

Our approach avoids the difficulties of non-uniqueness and incompleteness of traditional equi- 
librium approach, and offers solutions which are likely to be better than the solutions of traditional 
POMDPs applied to multi-agent settings. But these advantages come at the cost of processing and 
maintaining possibly infinitely nested interactive beliefs. Consequently, only approximate belief 
updates and approximately optimal solutions to planning problems are computable in general. We 
define a class of finitely nested I-POMDPs to form a basis for computable approximations to in- 
finitely nested ones. We show that a number of properties that facilitate solutions of POMDPs carry 
over to finitely nested I-POMDPs. In particular, the interactive beliefs are sufficient statistics for the 
histories of agent's observations, the belief update is a generalization of the update in POMDPs, the 
value function is piece-wise linear and convex, and the value iteration algorithm converges at the 
same rate. 

The remainder of this paper is structured as follows. We start with a brief review of related 
work in Section 2, followed by an overview of partially observable Markov decision processes in 
Section 3. There, we include a simple example of a tiger game. We introduce the concept of 
agent types in Section 4. Section 5 introduces interactive POMDPs and defines their solutions. The 
finitely nested I-POMDPs, and some of their properties are introduced in Section 6. We continue 
with an example application of finitely nested I-POMDPs to a multi-agent version of the tiger game 
in Section 7. There, we show examples of belief updates and value functions. We conclude with 
a brief summary and some current research issues in Section 8. Details of all proofs are in the 
Appendix. 

2. Related Work 

Our work draws from prior research on partially observable Markov decision processes, which 
recently gained a lot of attention within the AI community (Smallwood & Sondik, 1973; Monahan, 
1982; Lovejoy, 1991; Hausktecht, 1997; KaelbUng et al., 1998; Boutilier et al., 1999; Hauskrecht, 
2000). 

The formalism of Markov decision processes has been extended to multiple agents giving rise to 
stochastic games or Markov games (Fudenberg & Tirole, 1991). Traditionally, the solution concept 
used for stochastic games is that of Nash equilibria. Some recent work in AI follows that tradition 
(Littman, 1994; Hu & Wellman, 1998; Boutilier, 1999; Roller & Milch, 2001). However, as we 
mentioned before, and as has been pointed out by some game theorists (Binmore, 1990; Kadane & 
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Larkey, 1982), while Nash equilibria are useful for describing a multi-agent system when, and if, 
it has reached a stable state, this solution concept is not sufficient as a general control paradigm. 
The main reasons are that there may be multiple equilibria with no clear way to choose among them 
(non-uniqueness), and the fact that equilibria do not specify actions in cases in which agents believe 
that other agents may not act according to their equilibrium strategies (incompleteness). 

Other extensions of POMDPs to multiple agents appeared in AI literature recently (Bernstein, 
Givan, Immerman, & Zilberstein, 2002; Nair, Pynadath, Yokoo, Tambe, & Marsella, 2003). They 
have been called decentralized POMDPs (DEC-POMDPs), and are related to decentralized control 
problems (Ooi & Womell, 1996). DEC-POMDP framework assumes that the agents are fully coop- 
erative, i.e., they have common reward function and form a team. Furthermore, it is assumed that 
the optimal joint solution is computed centrally and then distributed among the agents for execution. 

From the game- theoretic side, we are motivated by the subjective approach to probability in 
games (Kadane & Larkey, 1982), Bayesian games of incomplete information (see Fudenberg & 
Tirole, 1991; Harsanyi, 1967, and references therein), work on interactive belief systems (Harsanyi, 
1967; Mertens & Zamir, 1985; Brandenburger & Dekel, 1993; Fagin, Halpern, Moses, & Vardi, 
1995; Aumann, 1999; Fagin, Geanakoplos, Halpern, & Vardi, 1999), and insights from research on 
learning in game theory (Fudenberg & Levine, 1998). Our approach, closely related to decision- 
theoretic (Myerson, 1991), or epistemic (Ambruster & Boge, 1979; Battigalli & Siniscalchi, 1999; 
Brandenburger, 2002) approach to game theory, consists of predicting actions of other agents given 
all available information, and then of choosing the agent's own action (Kadane & Larkey, 1982). 
Thus, the descriptive aspect of decision theory is used to predict others' actions, and its prescriptive 
aspect is used to select agent's own optimal action. 

The work presented here also extends previous work on Recursive Modeling Method (RMM) 
(Gmytrasiewicz & Durfee, 2000), but adds elements of belief update and sequential planning. 

3. Background: Partially Observable Markov Decision Processes 

A partially observable Markov decision process (POMDP) (Monahan, 1982; Hausktecht, 1997; 
Kaelbling et al., 1998; Boutilier et al., 1999; Hauskrecht, 2000) of an agent i is defined as 

POMDPi = {S, Ai, Ti, Oi, Ri) (1) 

where: 5 is a set of possible states of the environment. Ai is a set of actions agent i can execute. Ti is 
a transition function -Ti : S x AiY. S ^ [0,1] which describes results of agent i's actions, fij is the 
set of observations the agent i can make. Oi is the agent's observation function - Oj : S x x fij — > 
[0, 1] which specifies probabilities of observations given agent's actions and resulting states. Finally, 
Ri is the reward function representing the agent z's preferences - Ri: S x Ai^^. 

In POMDPs, an agent's belief about the state is represented as a probability distribution over S. 
Initially, before any observations or actions take place, the agent has some (prior) behef, h^. After 
some time steps, t, we assume that the agent has t + 1 observations and has performed t actions^ 
These can be assembled into agent i's observation history: h\ = {o°, o|, .., o*~^, o*} at time t. Let 
Hi denote the set of all observation histories of agent i. The agent's current belief, b\ over S, is 
continuously revised based on new observations and expected results of performed actions. It turns 

1. We assume that action is taken at every time step; it is without loss of generality since any of the actions maybe a 
No-op. 
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out that the agent's belief state is sufficient to summarize all of the past observation history and 
initial belief; hence it is called a sufficient statistic.^ 

The belief update takes into account changes in initial behef, due to action, a*~^, executed 
at time t — 1, and the new observation, oj. The new belief, bj, that the current state is s*, is: 



blis') = poM,s\a^') bl-\s'-')T,is\ais'-') (2) 

where /3 is the normalizing constant. 

It is convenient to summarize the above update performed for all states in S as 
bj = a*-^ o*) (Kaelbling et al., 1998). 

3.1 Optimality Criteria and Solutions 

The agent's optimahty criterion, OCi, is needed to specify how rewards acquired over time are 
handled. Commonly used criteria include: 

• A finite horizon criterion, in which the agent maximizes the expected value of the sum of the 
following T rewards: E{J2J^q rt). Here, rj is a reward obtained at time t and T is the length 
of the horizon. We will denote this criterion as fh^. 

• An infinite horizon criterion with discounting, according to which the agent maximizes 
E{J2tZo 7*^t)' where < 7 < 1 is a discount factor. We will denote this criterion as ih^. 

• An infinite horizon criterion with averaging, according to which the agent maximizes the 
average reward per time step. We will denote this as ih'^^. 

In what follows, we concentrate on the infinite horizon criterion with discounting, but our ap- 
proach can be easily adapted to the other criteria. 

The utility associated with a belief state, bi is composed of the best of the immediate rewards 
that can be obtained in bi, together with the discounted expected sum of utilities associated with 
belief states following 6j: 



U{bi) = maxl ^6i(s)i?i(s, Oj) + 7 X] Pf{oi\ai,bi)U{SEi{bi,ai,Oi)) \ (3) 

Value iteration uses the Equation 3 iteratively to obtain values of belief states for longer time 
horizons. At each step of the value iteration the error of the current value estimate is reduced by the 
factor of at least 7 (see for example Russell & Norvig, 2003, Section 17.2.) The optimal action, a|, 
is then an element of the set of optimal actions, OPT{bi), for the beUef state, defined as: 



OPT{bi) = argmaxl '^bi{s)Ri{s,ai) + 7 XI Pr{oi\ai,bi)U{SE{bi,ai,Oi)) > (4) 



2. See (Smallwood & Sondik, 1973) for proof. 
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POMDP with noise POMDP 

Figure 1: The value function for single agent tiger game with time horizon of length 1, OCi =fli^. 

Actions are: open right door - OR, open left door - OL, and listen - L. For this value of 
the time horizon the value function for a POMDP with noise factor is identical to single 
agent POMDP 



3.2 Example: The Tiger Game 

We briefly review the POMDP solutions to the tiger game (Kaelbling et al., 1998). Our purpose is 
to build on the insights that POMDP solutions provide in this simple case to illustrate solutions to 
interactive versions of this game later. 

The traditional tiger game resembles a game-show situation in which the decision maker has 
to choose to open one of two doors behind which lies either a valuable prize or a dangerous tiger. 
Apart from actions that open doors, the subject has the option of listening for the tiger's growl 
coming from the left, or the right, door. However, the subject's hearing is imperfect, with given 
percentages (say, 15%) of false positive and false negative occurrences. Following (Kaelbling et al., 
1998), we assume that the value of the prize is 10, that the pain associated with encountering the 
tiger can be quantified as -100, and that the cost of listening is -1. 

The value function, in Figure 1, shows values of various belief states when the agent's time 
horizon is equal to 1. Values of beliefs are based on best action available in that belief state, as 
specified in Eq. 3. The state of certainty is most valuable - when the agent knows the location of 
the tiger it can open the opposite door and claim the prize which certainly awaits. Thus, when the 
probability of tiger location is or 1, the value is 10. When the agent is sufficiently uncertain, its 
best option is to play it safe and listen; the value is then -1. The agent is indifferent between opening 
doors and listening when it assigns probabilities of 0.9 or 0.1 to the location of the tiger. 

Note that, when the time horizon is equal to 1, listening does not provide any useful information 
since the game does not continue to allow for the use of this information. For longer time horizons 
the benefits of results of listening results in policies which are better in some ranges of initial belief. 
Since the value function is composed of values corresponding to actions, which are linear in prob- 
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Figure 2: The value function for single agent tiger game compared to an agent facing a noise fac- 
tor, for horizon of length 2. Policies corresponding to value lines are conditional plans. 
Actions, L, OR or OL, are conditioned on observational sequences in parenthesis. For 
example L\();L\(GL),OL\(GR) denotes a plan to perform the listening action, L, at the 
beginning (list of observations is empty), and then another L if the observation is growl 
from the left (GL), and open the left door, OL, if the observation is GR. * is a wildcard 
with the usual interpretation. 



ability of tiger location, the value function has the property of being piece-wise Unear and convex 
(PWLC) for all horizons. This simplifies the computations substantially. 

In Figure 2 we present a comparison of value functions for horizon of length 2 for a single 
agent, and for an agent facing a more noisy environment. The presence of such noise could be 
due to another agent opening the doors or listening with some probabilities.^ Since POMDPs do 
not include explicit models of other agents, these noise actions have been included in the transition 
model, T. 

Consequences of folding noise into T are two-fold. First, the effectiveness of the agent's optimal 
policies declines since the value of hearing growls diminishes over many time steps. Figure 3 depicts 
a comparison of value functions for horizon of length 3. Here, for example, two consecutive growls 
in a noisy environment are not as valuable as when the agent knows it is acting alone since the noise 
may have perturbed the state of the system between the growls. For time horizon of length 1 the 
noise does not matter and the value vectors overlap, as in Figure 1. 

Second, since the presence of another agent is implicit in the static transition model, the agent 
cannot update its model of the other agent's actions during repeated interactions. This effect be- 
comes more important as time horizon increases. Our approach addresses this issue by allowing 
explicit modeling of the other agent(s). This results in policies of superior quality, as we show in 
Section 7. Figure 4 shows a policy for an agent facing a noisy environment for time horizon of 3. 
We compare it to the corresponding I-POMDP policy in Section 7. Note that it is slightly different 

3. We assumed that, due to the noise, either door opens with probabilities of 0.1 at each turn, and nothing happens with 
the probability 0.8. We explain the origin of this assumption in Section 7. 
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Figure 3: The value function for single agent tiger game compared to an agent facing a noise factor, 
for horizon of length 3. The "?" in the description of a policy stands for any of the 
perceptual sequences not yet listed in the description of the policy. 



[0-0.045) [0.045-0.135) [0.135-0.175) [0.175-0.825) [0.825-0.865) [0.865-0.955) [0.955-1 




Figure 4: The policy graph corresponding to value function of POMDP with noise depicted in 
Fig. 3. 
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than the poUcy without noise in the example by Kaelbling, Littman and Cassandra (1998) due to 
differences in value functions. 

4. Agent Types and Frames 

The POMDP definition includes parameters that permit us to compute an agent's optimal behavior,^ 
conditioned on its beliefs. Let us collect these implementation independent factors into a construct 
we call an agent i's type. 

Definition 1 (Type). A type of an agent i is, 9i = {hi, Ai, Qj, Tj, Oj, Ri, OCi), where hi is agent i's 
state of belief (an element of A{S) ), OCi is its optimality criterion, and the rest of the elements are 
as defined before. Let @i be the set of agent i 's types. 

Given type, 9i, and the assumption that the agent is Bayesian-rational, the set of agent's optimal 
actions will be denoted as OPT{6i). In the next section, we generalize the notion of type to situa- 
tions which include interactions with other agents; it then coincides with the notion of type used in 
Bayesian games (Fudenberg & Tirole, 1991; Harsanyi, 1967). 

It is convenient to define the notion of a frame, di, of agent i: 

Definition 2 (Frame). A frame of an agent i is, 9i = {Ai, Tj, Oi, Ri, OCi). Let @i be the set of 
agent i's frames. 

For brevity one can write a type as consisting of an agent's beUef together with its frame: 9i = 

{hi, 9,). 

In the context of the tiger game described in the previous section, agent type describes the 
agent's actions and their results, the quality of the agent's hearing, its payoffs, and its belief about 
the tiger location. 

Realistically, apart from implementation-independent factors grouped in type, an agent's be- 
havior may also depend on implementation-specific parameters, like the processor speed, memory 
available, etc. These can be included in the (implementation dependent, or complete) type, increas- 
ing the accuracy of predicted behavior, but at the cost of additional complexity. Definition and use 
of complete types is a topic of ongoing work. 

5. Interactive POMDPs 

As we mentioned, our intention is to generalize POMDPs to handle presence of other agents. We 
do this by including descriptions of other agents (their types for example) in the state space. For 
simplicity of presentation, we consider an agent i, that is interacting with one other agent, j. The 
formalism easily generalizes to larger number of agents. 

Definition 3 (I-POMDP). An interactive POMDP of agent i, I-POMDPi, is: 

I-POMDPi = {ISi, A, Ti, Qi, Oi, Ri) (5) 

4. The issue of computability of solutions to POMDPs has been a subject of much research (Papadimitriou & Tsitsiklis, 
1987; Madani, Hanks, & Condon, 2003). It is of obvious importance when one uses POMDPs to model agents; we 
return to this issue later. 
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where: 

• I Si is a set of interactive states defined as I Si = S x Mj,^ interacting with agent i, where 

5 is the set of states of the physical environment, and Mj is the set of possible models of agent 
j. Each model, nij € Mj, is defined as a triple rrij = {hj, fj,Oj), where fj : Hj ^{Aj) 
is agent j's function, assumed computable, which maps possible histories of j's observations to 
distributions over its actions, hj is an element of Hj, and Oj is a function specifying the way the 
environment is supplying the agent with its input. Sometimes we write model nij as mj = {hj^'fhj), 
where fhj consists of fj and Oj. It is convenient to subdivide the set of models into two classes. 
The subintentional models, SMj, are relatively simple, while the intentional models, IMj, use the 
notion of rationality to model the other agent. Thus, Mj = IMj U SMj. 

Simple examples of subintentional models include a no-information model and a fictitious play 
model, both of which are history independent. A no-information model (Gmytrasiewicz & Durfee, 
2000) assumes that each of the other agent's actions is executed with equal probability. Fictitious 
play (Fudenberg & Levine, 1998) assumes that the other agent chooses actions according to a fixed 
but unknown distribution, and that the original agent's prior belief over that distribution takes a form 
of a Dirichlet distribution.^ An example of a more powerful subintentional model is a finite state 
controller. 

The intentional models are more sophisticated in that they ascribe to the other agent beliefs, 
preferences and rationality in action selection.^ Intentional models are thus j's types, Oj = {bj, Oj), 
under the assumption that agent j is Bayesian-rational.^ Agent j's belief is a probability distribution 
over states of the environment and the models of the agent i; 6j G A(5' x Mj). The notion of a type 
we use here coincides with the notion of type in game theory, where it is defined as consisting of 
all of the agent i's private information relevant to its decision making (Harsanyi, 1967; Fudenberg 

6 Tirole, 1991). In particular, if agents' beliefs are private information, then their types involve 
possibly infinitely nested beliefs over others' types and their beliefs about others (Mertens & Zamir, 
1985; Brandenburger & Dekel, 1993; Aumann, 1999; Aumann & Heifetz, 2002)."^ They are related 
to recursive model structures in our prior work (Gmytrasiewicz & Durfee, 2000). The definition of 
interactive state space is consistent with the notion of a completely specified state space put forward 
by Aumann (1999). Similar state spaces have been proposed by others (Mertens & Zamir, 1985; 
Brandenburger & Dekel, 1993). 

• A = AiX Aj is the set of joint moves of all agents. 

• Ti is the transition model. The usual way to define the transition probabilities in POMDPs 
is to assume that the agent's actions can change any aspect of the state description. In case of I- 
POMDPs, this would mean actions modifying any aspect of the interactive states, including other 
agents' observation histories and their functions, or, if they are modeled intentionally, their beliefs 
and reward functions. Allowing agents to directly manipulate other agents in such ways, however, 
violates the notion of agents' autonomy. Thus, we make the following simplifying assumption: 

5. If there are more agents, say N > 2, then ISi = S x^S-^^ Mj 

6. Technically, according to our notation, fictitious play is actually an ensemble of models. 

7. Dennet (1986) advocates ascribing rationality to other agent(s), and calls it "assuming an intentional stance towards 
them". 

8. Note that the space of types is by far richer than that of computable models. In particular, since the set of computable 

models is countable and the set of types is uncountable, many types are not computable models. 

9. Implicit in the definition of interactive beliefs is the assumption of coherency (Brandenburger & Dekel, 1993). 
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Model Non-manipulabUity Assumption (MNM): Agents' actions do not change the other 

agents' models directly. 

Given this simplification, the transition model can be defined as Ti : S x A x S [0, 1] 
Autonomy, formalized by the MNM assumption, precludes, for example, direct "mind control", 
and implies that other agents' beUef states can be changed only indirectly, typically by changing the 
environment in a way observable to them. In other words, agents' beliefs change, like in POMDPs, 
but as a result of belief update after an observation, not as a direct result of any of the agents' 
actions. ^'^ 

• Qi is defined as before in the POMDP model. 

• Oi is an observation function. In defining this function we make the following assumption: 
Model Non-observability (MNO): Agents cannot observe other's models directly. 

Given this assumption the observation function is defined as Oi : S x A x Cli ^ [0, 1]. 

The MNO assumption formalizes another aspect of autonomy - agents are autonomous in that 
their observations and functions, or beliefs and other properties, say preferences, in intentional 
models, are private and the other agents cannot observe them directly. 

• Ri is defined as Ri : I Si x A ^ ^. We allow the agent to have preferences over physical 
states and models of other agents, but usually only the physical state will matter. 

As we mentioned, we see interactive POMDPs as a subjective counterpart to an objective ex- 
ternal view in stochastic games (Fudenberg & Tirole, 1991), and also followed in some work in 
AI (Boutilier, 1999) and (Roller & Milch, 2001) and in decentralized POMDPs (Bernstein et al, 
2002; Nair et al, 2003). Interactive POMDPs represent an individual agent's point of view on the 
environment and the other agents, and facihtate planning and problem solving at the agent's own 
individual level. 

5.1 Belief Update in I-POMDPs 

We will show that, as in POMDPs, an agent's beliefs over their interactive states are sufficient 
statistics, i.e., they fully summarize the agent's observation histories. Further, we need to show how 
beliefs are updated after the agent's action and observation, and how solutions are defined. 

The new belief state, 6*, is a function of the previous belief state, the last action, 
and the new observation, o*, just as in POMDPs. There are two differences that complicate belief 
update when compared to POMDPs. First, since the state of the physical environment depends on 
the actions performed by both agents the prediction of how the physical state changes has to be 
made based on the probabilities of various actions of the other agent. The probabilities of other's 
actions are obtained based on their models. Thus, unlike in Bayesian and stochastic games, we do 
not assume that actions are fully observable by other agents. Rather, agents can attempt to infer what 
actions other agents have performed by sensing their results on the environment. Second, changes in 
the models of other agents have to be included in the update. These reflect the other's observations 
and, if they are modeled intentionally, the update of the other agent's beliefs. In this case, the agent 
has to update its beliefs about the other agent based on what it anticipates the other agent observes 

10. The possibility that agents can influence the observational capabilities of other agents can be accommodated by 
including the factors that can change sensing capabiUties in the set S. 

1 1 . Again, the possibility that agents can observe factors that may influence the observational capabilities of other agents 
is allowed by including these factors in S. 
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and how it updates. As could be expected, the update of the possibly infinitely nested beUef over 
other's types is, in general, only asymptotically computable. 

Proposition 1. (Sufficiency) In an interactive POMDP of agent i, i's current belief, i.e., the proba- 
bility distribution over the set S x Mj, is a sufficient statistic for the past history ofi 's observations. 

The next proposition defines the agent f's belief update function, h\{is^) = Pr{is^\o\, a-^^, ^*^), 
where is* G I Si is an interactive state. We use the belief state estimation function, SEq., as an ab- 
breviation for belief updates for individual states so that h\ = SEg^ (6*~^, a*~^, o*). 
Tg^{bl~^ , al"^ , ol, bj) will stand for Pr(6*|6*~^, o*~^, o*). Further below we also define the set of 
type-dependent optimal actions of an agent, OPT{0i). 



Proposition 2. (Belief Update) Under the MNM and MNO assumptions, the belief update function 
for an interactive POMDP {I Si, A, Ti,Cli,Oi, Ri), when mj in is* is intentional, is: 




When rrij in is* is subintentional the first summation extends over is* ^ : rh* ^ = m*, 
Pr{a*~^\9*~^) is replaced with Pr{a*^^\'m*^^), and r^t a*~^, o*-, 6*) is replaced with the 
Kronecker delta function 5K{APPEND{h*~^ ,Oj), hj). 

Above, b*"^ and b* are the belief elements of 9*~^ and 9*i, respectively, /3 is a normalizing constant, 
and Pr(a*^|^?*^) is the probability that a*~^ is Bayesian rational for agent described by type 
9^~^. This probability is equal to ^oPT{e )\ ^^^^ ^ OPT{9j), and it is equal to zero otherwise. 
We define OPT in Section 5.2.^^ For the case of j's subintentional model, is = {s, mj), h!^~^ and 
h*- are the observation histories which are part of rn*~^, and m* respectively, Oj is the observation 
function in m*, and Pr(a*^^|m*~^) is the probability assigned by m*^^ to a*^- APPEND returns 
a string with the second argument appended to the first. The proofs of the propositions are in the 
Appendix. 

Proposition 2 and Eq. 6 have a lot in common with belief update in POMDPs, as should be 
expected. Both depend on agent i's observation and transition functions. However, since agent i's 
observations also depend on agent j's actions, the probabilities of various actions of j have to be 
included (in the first line of Eq. 6.) Further, since the update of agent j's model depends on what 
j observes, the probabilities of various observations of j have to be included (in the second line of 
Eq. 6.) The update of j's beliefs is represented by the Tg^ term. The belief update can easily be 
generahzed to the setting where more than one other agents co-exist with agent i. 

12. If the agent's prior belief over ISi is given by a probability density function then the X]is*-i replaced by 
an integral. In that case rgt (&*~^, a*"'^, oJ, 6*) takes the form of Dirac delta function over argument bj~^: 

5D{SEstXb'-\aY\o'^-h]). 
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5.2 Value Function and Solutions in I-POMDPs 

Analogously to POMDPs, each belief state in I-POMDP has an associated value reflecting the max- 
imum payoff the agent can expect in this belief state: 

U{9i) = max\ Y,ERi{is,ai)hi{is) + 7 X) Pr{oi\ai,bi)U{{SE0^{bi,ai,Oi),di)) \ (7) 

where, ERi{is,ai) = Y^^, Ri{is,ai,aj)Pr{aj\mj). Eq. 7 is a basis for value iteration in I- 
POMDPs. 

Agent i's optimal action, a*, for the case of infinite horizon criterion with discounting, is an 
element of the set of optimal actions for the behef state, OPT{9i), defined as: 

OPT{9i) = argmaxl Y^ERi{is,ai)hi{is) + 7 X] Pr{oi\ai,bi)U{{SE0.{bi,ai,Oi),9i)) \ 

ai^Ai I, is Oi^Qi J 

(8) 

As in the case of belief update, due to possibly infinitely nested behefs, a step of value iteration 
and optimal actions are only asymptotically computable. 



6. Finitely Nested I-POMDPs 

Possible infinite nesting of agents' behefs in intentional models presents an obvious obstacle to 
computing the belief updates and optimal solutions. Since the models of agents with infinitely 
nested beliefs correspond to agent functions which are not computable it is natural to consider 
finite nestings. We follow approaches in game theory (Aumann, 1999; Brandenburger & Dekel, 
1993; Fagin et al., 1999), extend our previous work (Gmytrasiewicz & Durfee, 2000), and construct 
finitely nested 1-POMDPs bottom-up. Assume a set of physical states of the world S, and two 
agents i and j. Agent i's 0-th level beliefs, bi^, are probabihty distributions over S. Its 0-th level 
types, Qifl, contain its 0-th level behefs, and its frames, and analogously for agent j. 0-level types 
are, therefore, POMDPs.^^ 0-level models include 0-level types (i.e., intentional models) and the 
subintentional models, elements of SM. An agent's first level beliefs are probability distributions 
over physical states and 0-level models of the other agent. An agent's first level types consist of 
its first level beliefs and frames. Its first level models consist of the types upto level 1 and the 
subintentional models. Second level beliefs are defined in terms of first level models and so on. 
Formally, define spaces: 

151.0 = S, e,,o = {(6,,o,^j) :&i,oG A(/5,-o)}, M.-q = Oj^USMj 

151.1 = SxMj,o, = {{bj,i,9j):bj,i€AiISj,i)}, M,-i = 6,- 1 U M,- 

ISi^i = SxMj,i_i, G,y = {{bj,i,9j):bj,ieA{ISj,i)}, Mj^i = Q^^lV^M^,l_^ 
Definition 4. (Finitely Nested I-POMDP) A finitely nested I-POMDP of agent i, I-POMDP i^i, is: 

I-POMDPi^i = {ISi^i ,A,Ti,ni,Oi,Ri) (9) 
13. In 0-level types the other agent's actions are folded into the T, O and R functions as noise. 
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The parameter I will be called the strategy level of the finitely nested I-POMDP. The belief update, 
value function, and the optimal actions for finitely nested I-POMDPs are computed using Equation 6 
and Equation 8, but recursion is guaranteed to terminate at 0-th level and subintentional models. 

Agents which are more strategic are capable of modeling others at deeper levels (i.e., all levels 
up to their own strategy level I), but are always only boundedly optimal. As such, these agents 
could fail to predict the strategy of a more sophisticated opponent. The fact that the computability 
of an agent function implies that the agent may be suboptimal during interactions has been pointed 
out by Binmore (1990), and proved more recently by Nachbar and Zame (1996). Intuitively, the 
difficulty is that an agent's unbounded optimality would have to include the capability to model the 
other agent's modeling the original agent. This leads to an impossibility result due to self -reference, 
which is very similar to Godel's incompleteness theorem and the halting problem (Brandenburger, 
2002). On a positive note, some convergence results (Kalai & Lehrer, 1993) strongly suggest that 
approximate optimality is achievable, although their appUcabiUty to our work remains open. 

As we mentioned, the 0-th level types are POMDPs. They provide probability distributions 
over actions of the agent modeled at that level to models with strategy level of 1. Given probability 
distributions over other agent's actions the level- 1 models can themselves be solved as POMDPs, 
and provide probability distributions to yet higher level models. Assume that the number of models 
considered at each level is bound by a number, M. Solving an I-POMDPn in then equivalent to 
solving O(M') POMDPs. Hence, the complexity of solving an I-POMDPi^i is PSPACE-hard for 
finite time horizons,^'^ and undecidable for infinite horizons, just like for POMDPs. 

6.1 Some Properties of I-POMDPs 

In this section we establish two important properties, namely convergence of value iteration and 
piece- wise linearity and convexity of the value function, for finitely nested I-POMDPs. 

6.1.1 Convergence of Value Iteration 

For an agent i and its I-POMDPi i, we can show that the sequence of value functions, {C/"}, where 
n is the horizon, obtained by value iteration defined in Eq. 7, converges to a unique fixed -point, U*. 

Let us define a backup operator H : B ^ B such that U'"' = HU"'^^, and B is the set of all 
bounded value functions. In order to prove the convergence result, we first establish some of the 
properties of H. 

Lemma 1 (Isotonicity). For any finitely nested I-POMDP value functions V and U,ifV< U, then 
HV < HU. 

The proof of this lemma is analogous to one due to Hauskrecht (1997), for POMDPs. It is 
also sketched in the Appendix. Another important property exhibited by the backup operator is the 
property of contraction. 

Lemma 2 (Contraction). For any finitely nested I-POMDP value functions V, U and a discount 
factor e (0,1), \\HV-HU\\ < j\\V - U\\. 

The proof of this lemma is again similar to the corresponding one in POMDPs (Hausktecht, 
1997). The proof makes use of Lemma 1. 1 1 • 1 1 is the supremum norm. 



14. Usually PSPACE-complete since the number of states in I-POMDPs is likely to be larger than the time horizon 
(Papadimitriou & Tsitsiklis, 1987). 
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Under the contraction property of H, and noting that the space of value functions along with 
the supremum norm forms a complete normed space (Banach space), we can apply the Contraction 
Mapping Theorem (Stokey & Lucas, 1989) to show that value iteration for I-POMDPs converges 
to a unique fixed point (optimal solution). The following theorem captures this result. 

Theorem 1 (Convergence). For any finitely nested I-POMDP, the value iteration algorithm start- 
ing from any arbitrary well-defined value fimction converges to a unique fixed-point. 

The detailed proof of this theorem is included in the Appendix. 

As in the case of POMDPs (Russell & Norvig, 2003), the error in the iterative estimates, C/", for 
finitely nested I-POMDPs, i.e., 1 — C/* 1 1, is reduced by the factor of at least 7 on each iteration. 
Hence, the number of iterations, N, needed to reach an error of at most e is: 

N = \\og{Rma./e{l - 7))/ log(l/7)l (10) 
where Rmax is the upper bound of the reward function. 

6.1.2 PiECEWiSE Linearity and Convexity 

Another property that carries over from POMDPs to finitely nested I-POMDPs is the piecewise 
linearity and convexity (PWLC) of the value function. Establishing this property allows us to de- 
compose the 1-POMDP value function into a set of alpha vectors, each of which represents a policy 
tree. The PWLC property enables us to work with sets of alpha vectors rather than perform value 
iteration over the continuum of agent's beliefs. Theorem 2 below states the PWLC property of the 
I-POMDP value function. 

Theorem 2 (PWLC). For any finitely nested I-POMDP, U is piecewise linear and convex. 

The complete proof of Theorem 2 is included in the Appendix. The proof is similar to one 
due to Smallwood and Sondik (1973) for POMDPs and proceeds by induction. The basis case is 
established by considering the horizon 1 value function. Showing the PWLC for the inductive step 
requires substituting the belief update (Eq. 6) into Eq. 7, followed by factoring out the belief from 
both terms of the equation. 

7. Example: Multi-agent Tiger Game 

To illustrate optimal sequential behavior of agents in multi-agent settings we apply our I-POMDP 
framework to the multi-agent tiger game, a traditional version of which we described before. 

7.1 Definition 

Let us denote the actions of opening doors and listening as OR, OL and L, as before. TL and 
TR denote states corresponding to tiger located behind the left and right door, respectively. The 
transition, reward and observation functions depend now on the actions of both agents. Again, we 
assume that the tiger location is chosen randomly in the next time step if any of the agents opened 
any doors in the current step. We also assume that the agent hears the tiger's growls, GR and GL, 
with the accuracy of 85%. To make the interaction more interesting we added an observation of 
door creaks, which depend on the action executed by the other agent. Creak right, CR, is likely due 
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to the other agent having opened the right door, and similarly for creak left, CL. Silence, S, is a good 
indication that the other agent did not open doors and listened instead. We assume that the accuracy 
of creaks is 90%. We also assume that the agent's payoffs are analogous to the single agent versions 
described in Section 3.2 to make these cases comparable. Note that the result of this assumption is 
that the other agent's actions do not impact the original agent's payoffs directly, but rather indirectly 
by resulting in states that matter to the original agent. Table 1 quantifies these factors. 
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Observation functions of agents i and j. 



Table 1: Transition, reward, and observation functions for the multi-agent Tiger game. 

When an agent makes its choice in the multi-agent tiger game, it considers what it believes 
about the location of the tiger, as well as whether the other agent will listen or open a door, which in 
turn depends on the other agent's beliefs, reward function, optimality criterion, etc.'^ In particular, 
if the other agent were to open any of the doors the tiger location in the next time step would be 
chosen randomly. Thus, the information obtained from hearing the previous growls would have to 
be discarded. We simplify the situation by considering i's 1-POMDP with a single level of nesting, 
assuming that all of the agent j's properties, except for beliefs, are known to i, and that j's time 
horizon is equal to i's. In other words, i's uncertainty pertains only to j's beliefs and not to its 
frame. Agent z's interactive state space is, ISj,! = S x Qj,o, where S is the physical state, S'={TL, 



15. We assume an intentional model of tlie other agent here. 
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TR}, and Qj^ is a set of intentional models of agent j's, each of which differs only in j's beliefs 
over the location of the tiger. 

7.2 Examples of the Belief Update 

In Section 5, we presented the belief update equation for I-POMDPs (Eq. 6). Here we consider 
examples of beliefs, 6j i, of agent i, which are probability distributions over S x ©j,o- Each 0-th 
level type of agent j, Oj^ e Qj,o, contains a "flat" belief as to the location of the tiger, which can be 
represented by a single probability assignment - bjfl = pj {TL) . 
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Figure 5: Two examples of singly nested belief states of agent i. In each case i has no information 
about the tiger's location. In (i) agent i knows that j does not know the location of the 
tiger; the single point (star) denotes a Dirac delta function which integrates to the height 
of the point, here 0.5 . In {ii) agent i is uninformed about j's beliefs about tiger's location. 



In Fig. 5 we show some examples of level 1 beliefs of agent i. In each case i does not know 
the location of the tiger so that the marginals in the top and bottom sections of the figure sum up to 
0.5 for probabilities of TL and TR each. In Fig. 5(i), i knows that j assigns 0.5 probability to tiger 
being behind the left door. This is represented using a Dirac delta function. In Fig. 5{ii), agent i is 
uninformed about j's beliefs. This is represented as a uniform probability density over all values of 
the probability j could assign to state TL. 

To make the presentation of the belief update more transparent we decompose the formula in 
Eq. 6 into two steps: 
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Prediction: When agent i performs an action a\ ^, and given that agent j performs a* ^, the 
predicted belief state is: 



XT^.(65-i,a5-\o*,6*) 



(11) 



Correction: When agent i perceives an observation, o|, the predicted belief states, 



Pr{-\a^ ,aj ), are combined according to: 



where /3 is the normalizing constant. 
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Figure 6: A trace of the belief update of agent i. (a) depicts the prior. (5) is the result of prediction 
given i's listening action, L, and a pair denoting j's action and observation, i knows that 
j will listen and could hear tiger's growl on the right or the left, and that the probabilities 
j would assign to TL are 0.15 or 0.85, respectively, (c) is the result of correction after 
i observes tiger's growl on the left and no creaks, (GL,S). The probability i assigns to 
TL is now greater than TR. {d) depicts the results of another update (both prediction and 
correction) after another listen action of i and the same observation, (GL,S). 

Each discrete point above denotes, again, a Dirac delta function which integrates to the height of 
the point. 

In Fig. 6, we display the example trace through the update of singly nested belief. In the first 
column of Fig. 6, labeled (a), is an example of agent i's prior belief we introduced before, according 
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to which i knows that j is uninformed of the location of the tiger. Let us assume that i listens and 
hears a growl from the left and no creaks. The second column of Fig. 6, (b), displays the predicted 
belief after i performs the listen action (Eq. 11). As part of the prediction step, agent i must solve 
j's model to obtain j 's optimal action when its belief is 0.5 (term Pr{aY^\Oj) in Eq. 11). Given the 
value function in Fig. 3, this evaluates to probability of 1 for listen action, and zero for opening of 
any of the doors, i also updates j's belief given that j listens and hears the tiger growling from either 
the left, GL, or right, GR, (term rgt(6*^^, o*, 6*) in Eq. 11). Agent j's updated probabilities 
for tiger being on the left are 0.85 and 0.15, for j's hearing GL and GR, respectively. If the tiger is 
on the left (top of Fig. 6 (b)) j's observation GL is more Ukely, and consequently j's assigning the 
probability of 0.85 to state TL is more likely {i assigns a probability of 0.425 to this state.) When 
the tiger is on the right j is more likely to hear GR and i assigns the lower probability, 0.075, to 
j's assigning a probability 0.85 to tiger being on the left. The third column, (c), of Fig. 6 shows 
the posterior beUef after the correction step. The beUef in column (b) is updated to account for z's 
hearing a growl from the left and no creaks, (GL,S). The resulting marginalised probability of the 
tiger being on the left is higher (0.85) than that of the tiger being on the right. If we assume that in 
the next time step i again listens and hears the tiger growling from the left and no creaks, the belief 
state depicted in the fourth column of Fig. 6 results. 

In Fig. 7 we show the belief update starting from the prior in Fig. 5 (ii), according to which 
agent i initially has no information about what j believes about the tiger's location. 

The traces of belief updates in Fig. 6 and Fig. 7 illustrate the changing state of information agent 
i has about the other agent's beliefs. The benefit of representing these updates explicitly is that, at 
each stage, i's optimal behavior depends on its estimate of probabilities of j's actions. The more 
informative these estimates are the more value agent i can expect out of the interaction. Below, we 
show the increase in the value function for I-POMDPs compared to POMDPs with the noise factor. 

7.3 Examples of Value Functions 

This section compares value functions obtained from solving a POMDP with a static noise factor, 
accounting for the presence of another agent, to value functions of level- 1 I-POMDP. The advan- 
tage of more refined modeling and update in I-POMDPs is due to two factors. First is the ability to 
keep track of the other agent's state of beliefs to better predict its future actions. The second is the 
ability to adjust the other agent's time horizon as the number of steps to go during the interaction 
decreases. Neither of these is possible within the classical POMDP formalism. 

We continue with the simple example of I-POMDPi i of agent i. In Fig. 8 we display i's 
value function for the time horizon of 1, assuming that i's initial belief as to the value j assigns 
to TL, pj{TL), is as depicted in Fig. 5 [ii], i.e. i has no information about what j believes about 
tiger's location. This value function is identical to the value function obtained for an agent using 
a traditional POMDP framework with noise, as well as single agent POMDP which we described 
in Section 3.2. The value functions overlap since agents do not have to update their behefs and 
the advantage of more refined modeling of agent j in i's I-POMDP does not become apparent. Put 
another way, when agent i models j using an intentional model, it concludes that agent j will open 
each door with probability O.I and listen with probability 0.8. This coincides with the noise factor 
we described in Section 3.2. 

16. The points in Fig. 7 again denote Dirac delta functions which integrate to the value equal to the points' height. 

17. The POMDP with noise is the same as level-0 1-POMDP 
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Figure 7: A trace of the belief update of agent i. (a) depicts the prior according to which i is 
uninformed about j's beliefs, (b) is the result of the prediction step after i's listening 
action (L). The top half of (6) shows i's belief after it has listened and given that j also 
listened. The two observations j can make, GL and GR, each with probability dependent 
on the tiger's location, give rise to flat portions representing what i knows about j's belief 
in each case. The increased probability i assigns to j's belief between 0.472 and 0.528 is 
due to j's updates after it hears GL and after it hears GR resulting in the same values in 
this interval. The bottom half of (6) shows i's belief after i has listened and j has opened 
the left or right door (plots are identical for each action and only one of them is shown), i 
knows that j has no information about the tiger's location in this case, (c) is the result of 
correction after i observes tiger's growl on the left and no creaks (GL,S). The plots in (c) 
are obtained by performing a weighted summation of the plots in (6). The probability i 
assigns to TL is now greater than TR, and information about j's beliefs allows i to refine 
its prediction of j's action in the next time step. 
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Figure 8: For time horizon of 1 the value functions obtained from solving a singly nested I-POMDP 
and a POMDP with noise factor overlap. 



L\();OL\(<GR,S>),L\(?) L\();OR\(<GL,S>),L\(?) 
L\();L\(<GL,*>),OL\(<GR,*>) L\();L\(*) I L\();OR\(<GL,*>),L\(<GR,*>) 



OL\();L\(*) L\();L\(GL),OLi(GR) 



L\();OR\(GL),I^\(GR) oR\();L\(*) 




0.4 0.6 

Pi(TL) 

Level 1 l-POMDP POMDP with noise 



Figure 9: Comparison of value functions obtained from solving an l-POMDP and a POMDP with 
noise for time horizon of 2. I-POMDP value function dominates due to agent i adjusting 
the behavior of agent j to the remaining steps to go in the interaction. 
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Figure 10: Comparison of value functions obtained from solving an I-POMDP and a POMDP with 
noise for time horizon of 3. I-POMDP value function dominates due to agent i's adjust- 
ing j's remaining steps to go, and due to i's modeling j's belief update. Both factors 
allow for better predictions of j's actions during interaction. The descriptions of indi- 
vidual policies were omitted for clarity; they can be read off of Fig. 11. 



In Fig. 9 we display i's value functions for the time horizon of 2. The value function of 
I-POMDP i I is higher than the value function of a POMDP with a noise factor. The reason is 
not related to the advantages of modeling agent j's beliefs - this effect becomes apparent at the time 
horizon of 3 and longer. Rather, the I-POMDP solution dominates due to agent i modeling j's time 
horizon during interaction: i knows that at the last time step j will behave according to its optimal 
policy for time horizon of 1 , while with two steps to go j will optimize according to its 2 steps to go 
policy. As we mentioned, this effect cannot be modeled using a POMDP with a static noise factor 
included in the transition function. 

Fig. 10 shows a comparison between the I-POMDP and the noisy POMDP value functions for 
horizon 3. The advantage of more refined agent modeling within the I-POMDP framework has 
increased.'^ Both factors, z's adjusting j's steps to go and i's modeling j's belief update during 
interaction are responsible for the superiority of values achieved using the I-POMDP. In particular, 
recall that at the second time step i's information as to j's beliefs about the tiger's location is as 
depicted in Fig. 7 (c). This enables i to make a high quality prediction that, with two steps left to 
go, j will perform its actions OL, L, and OR with probabihties 0.009076, 0.96591 and 0.02501, 
respectively (recall that for POMDP with noise these probabilities remained unchanged at 0.1, 0,8, 
and 0.1, respectively.) 

Fig. 1 1 shows agent i's policy graph for time horizon of 3. As usual, it prescribes the optimal 
first action depending on the initial belief as to the tiger's location. The subsequent actions depend 
on the observations received. The observations include creaks that are indicative of the other agent's 

18. Note that I-POMDP solution is not as good as the solution of a POMDP for an agent operating alone in the environ- 
ment shown in Fig. 3. 
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Figure 11: The policy graph corresponding to the I-POMDP value function in Fig. 10. 



having opened a door. The creaks contain valuable information and allow the agent to make more 
refined choices, compared to ones in the noisy POMDP in Fig. 4. Consider the case when agent i 
starts out with fairly strong belief as to the tiger's location, decides to listen (according to the four 
off-center top row "L" nodes in Fig. 11) and hears a door creak. The agent is then in the position to 
open either the left or the right door, even if that is counter to its initial belief. The reason is that the 
creak is an indication that the tiger's position has likely been reset by agent j and that j will then 
not open any of the doors during the following two time steps. Now, two growls coming from the 
same door lead to enough confidence to open the other door. This is because the agent i's hearing 
of tiger's growls are indicative of the tiger's position in the state following the agents' actions, 

Note that the value functions and the policy above depict a special case of agent i having no 
information as to what probability j assigns to tiger's location (Fig. 5 (ii)). Accounting for and 
visualizing all possible beliefs i can have about j's beliefs is difficult due to the complexity of the 
space of interactive beliefs. As our ongoing work indicates, a drastic reduction in complexity is 
possible without loss of information, and consequently representation of solutions in a manageable 
number of dimensions is indeed possible. We will report these results separately. 

8. Conclusions 

We proposed a framework for optimal sequential decision-making suitable for controlling autonomous 
agents interacting with other agents within an uncertain environment. We used the normative 
paradigm of decision-theoretic planning under uncertainty formalized as partially observable Markov 
decision processes (POMDPs) as a point of departure. We extended POMDPs to cases of agents 
interacting with other agents by allowing them to have beliefs not only about the physical environ- 
ment, but also about the other agents. This could include beliefs about the others' abihties, sensing 
capabilities, beUefs, preferences, and intended actions. Our framework shares numerous properties 
with POMDPs, has analogously defined solutions, and reduces to POMDPs when agents are alone 
in the environment. 

In contrast to some recent work on DEC-POMDPs (Bernstein et al., 2002; Nair et al., 2003), 
and to work motivated by game-theoretic equilibria (Boutilier, 1999; Hu & Wellman, 1998; KoUer 
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& Milch, 2001; Littman, 1994), our approach is subjective and amenable to agents independently 
computing their optimal solutions. 

The line of work presented here opens an area of future research on integrating frameworks for 
sequential planning with elements of game theory and Bayesian learning in interactive settings. In 
particular, one of the avenues of our future research centers on proving further formal properties of 
I-POMDPs, and establishing clearer relations between solutions to I-POMDPs and various flavors 
of equilibria. Another concentrates on developing efficient approximation techniques for solving 
I-POMDPs. As for POMDPs, development of approximate approaches to I-POMDPs is crucial for 
moving beyond toy problems. One promising approximation technique we are working on is particle 
filtering. We are also devising methods for representing I-POMDP solutions without assumptions 
about what's believed about other agents' beliefs. As we mentioned, in spite of the complexity of the 
interactive state space, there seem to be intuitive representations of belief partitions corresponding 
to optimal policies, analogous to those for POMDPs. Other research issues include the suitable 
choice of priors over models, and the ways to fulfill the absolute continuity condition needed for 
convergence of probabilities assigned to the alternative models during interactions (Kalai & Lehrer, 
1993). 
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Appendix A. Proofs 

Proof of Propositions 1 and 2. We start with Proposition 2, by applying the Bayes Theorem: 
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(13) 



19. We are looking at Kolmogorov complexity (Li & Vitanyi, 1997) as a possible way to assign priors. 
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To simplify the term Pr(is*|a* ^, is* ^) let us substitute the interactive state is* with its com- 
ponents. When ruj in the interactive states is intentional: is* = (s*, ^*) = (s*, 

Pr(is*|a*-\is*-i) = Pr(s*, 6*, ^.|a*-\ is*-^) 

= Pr(6*|s*, a*-S is*-i)Pr(s*, ^■|a*-\ is*-^) 

= Pr(6* |s*, o*-i. is*-i)Pr(^.|s*, a*-\ is*-i)Pr(s*|a*-\ is*"^) 

= Pr(6* |s*, ^,a'-\is'-^)I{^-\e'j)Ti{s'-\a'-\ s*) 

(14) 

When mj is subintentional: is* = (s*, m*) = (s*, hj,fh^j). 

= Pr{hj\s^, ■mpa^^^,is^^^)Pr{s^, in* |a*~^, is*~-^) 

= Pr(/i* |s*, m*-, a*~-^, is*~^)Pr(^* |s*, a^~^ , is^~^) Pr{s^\a^~^ , is^~^) 

= Prlh^j\s^,m*j,a^-\is^-^)I{m^r^,m*j)Ti{s^-^ (14') 

The joint action pair, a*~^, may change the physical state. The third term on the right-hand 
side of Eqs. 14 and 14' above captures this transition. We utilized the MNM assumption to replace 
the second terms of the equations with boolean identity functions, /(^*~^,6'*) and /(fri*"^, rn*) 
respectively, which equal 1 if the two frames are identical, and otherwise. Let us turn our attention 
to the first terms. If rrij in is* and is*~^ is intentional: 

Pr(6* |s*, ^*, a*-\ is*-i) = Pr{h^-\s\ e^j,a^-^,is^-\oj)Pr{o^j\s\ a*-i, is*-i) 

= Eo* Pr{b^-\s\ a*-i, is*-i, o*)Pr(o* |s*, ^, a*"!) (15) 
= Eo* ^ej (&r'' «r'' ^i' ^'j)Ojist, a*-^ o* ) 

Else if it is subintentional: 

Pr(/i*-|s*, in*-, a*"-*^, is*"-*^) = Pr(/i* |s*, rn*, a*~^, is*~^, o*)Pr(o* |s*, in*-, a*^^, is*"-^) 

= Pr(/i* |s*, rn*, a*~^, is*~^, o*)Pr(o* |s*, rh}-,a^~^) 
= Eoj (^/^(APPEND(/i*-\ o*), /i*)0,(st, a*-i, o*) (15') 

In Eq. 15, the first term on the right-hand side is 1 if agent j's belief update, SEg- (6*^ , a*^^ , o* ) 
generates a belief state equal to 6* . Similarly, in Eq. 15', the first term is 1 if appending the o* 
to /t*~^ results in /t*. Sk is the Kronecker delta function. In the second terms on the right-hand 
side of the equations, the MNO assumption makes it possible to replace Pr(o* |s*, 6*-, a*~^) with 
Oj(s*, a*^-*^, o*), and Pr(o*|s*, in*, a*~^) with Oj(s*, a*~^, o*) respectively. 
Let us now substitute Eq. 15 into Eq. 14. 

Pr(is*|a*-i, is*-i) = Eo* ^ {h]r\a]r\o], 6*)0,(s*, a*-i, o*)/(^.-\ ^.)?i(s*-\ a*-i, s*) 

(16) 

Substituting Eq. 15' into Eq. 14' we get, 

Pr{is^\a^-^,is^-^) = 5/f(APPEND(/i*-S op, /i*)Oj(s*, a*-i, o*)I(in*-\ in*) 

xTi{y\a'-\s') (16') 
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Replacing Eq. 16 into Eq. 13 we get: 



(17) 

Similarly, replacing Eq. 16' into Eq. 13 we get: 



X Eo* <^i^(APPEND(/i;.-\ o*), h'^Oj{s\ a*"!, o'^I{fh]r\fh'^^Ti{s'-^ ,a'-^ , s*) 

We arrive at the final expressions for the belief update by removing the terms ^, ^j) and 
/(m*^^, m*) and changing the scope of the first summations. 
When in the interactive states is intentional: 



X Eo* ^e* {hf\af\o],h'AO,{s\ L'-\o))T,{s'-\a'-\ s'^ ^^^^ 



Else, if it is subintentional: 



3 3 



X Eo' <5x(APPEND(/j*-\ o*), /i*)0,(s*, a*-i, o*)ri(s*-\ a*"!, s*) 



(19) 



Since proposition 2 expresses the belief 6*(^s*) in terms of parameters of the previous time step 
only. Proposition 1 holds as well. □ 

Before we present the proof of Theorem 1 we note that the Equation 7, which defines value 
iteration in 1-POMDPs, can be rewritten in the following form, = HW^'^. Here, H : B ^ B 
is a backup operator, and is defined as, 

HW'-^iei) = max h{0i,ai,U''~^) 

where h : Qi x Ai x B ^ M. is, 

h{ei,ai,U) = J2bi{is)ERi{is,ai) + jJ2o£ni P'^(oi\"-i^bi)U{{SEe.{bi,ai,Oi),9i)) 

is 

and where B is the set of all bounded value functions U. Lemmas 1 and 2 establish important 
properties of the backup operator. Proof of Lemma 1 is given below, and proof of Lemma 2 follows 
thereafter. 



Proof of Lemma 1. Select arbitrary value functions V and U such that V{Oi^i) < U{6i^i) "iOi^i G 
Qi^i. Let Oi^i be an arbitrary type of agent i. 
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HViOi^l) = max\ Zis hi{is)ERi{is, a^) + 7Eo6f7 Pr{o^\a,,,h;)V{{SEe^ 



{bi,ai,Oi),ei)) 



= E^s hiis)ER,iis, a*) + 7 EoeQ. Mo^a*, k)Vi{SEg^^^ (6„ a*, o,), 9,)) 
< Eis h{is)ERSs, a*) + 7 EoeQ, Pr{o^\o^, h)U{{SEe^ , (6„ a*, o,), 9,)) 



< max < J2is bi{is)ERi{is, ai) + 7 Eoef2i Pr{oi\ai, bi)U {{SEo^ i {hi, ai, Oi),9i)) 
= HU{9i,i) 

Since 9i^i is arbitrary, HV < HU. □ 

Proof of Lemma 2. Assume two arbitrary well defined value functions V and U such that V < U. 
From Lemma 1 it follows that HV < HU. Let 9i^i be an arbitrary type of agent i. Also, let a| be 
the action that optimizes HU{9i^i). 

<HU{9i^i)-HV{9i^i) 

= maxl sumisbi{is)ERi{is,ai) + 'j^^^^. Pr{oi\ai,bi)U{SE0.^^{bi,ai,Oi), {9i)) ^ - 

maxl ^^^bi{is)ERi{is,ai) + j^^^^.Pr{oi\ai,bi)V{SEoii{bi,ai,Oi), {9i)) 

< Eis biiis)ERi{is, a*) + 7 Eoen, Pr{oi\a*, bi)U{SEg.^ {bi, a*,Oi), {Oi}) - 
E,, bi{is)ER,{is, a*) - 7 Eoen, Pr{o^K, bi)V{SEgJbi, a*, o^), 

= 7Eo^ii, Pr{o,\a*M)U{SEe^,{h,aloi), {9-))- 
7 EoeQ, Pr{o^\at, hi)V{SEe^ , (6„ a*, o,), (^,)) 



= 7Eoef7, Pr{oi\a*,b, 

< iT^oeiii Pr{oi\a*,bi 
= j\\U-V\\ 



UiSEeJbi, a*, a), {9i)) - V{SEe,,{bi, a*, o^), {9i)) 
U-V\\ 



As the supremum norm is symmetrical, a similar result can be derived for HV{9i^i) — HU {9i^i). 
Since 9i^i is arbitrary, the Contraction property follows, i.e. \\HV — HU\\ <\\V — U\\. □ 

Lemmas 1 and 2 provide the stepping stones for proving Theorem 1 . Proof of Theorem 1 follows 
from a straightforward application of the Contraction Mapping Theorem. We state the Contraction 
Mapping Theorem (Stokey & Lucas, 1989) below: 

Theorem 3 (Contraction Mapping Theorem). If{S,p) is a complete metric space and T : S ^ S 
is a contraction mapping with modulus 7, then 

1. T has exactly one fixed point U* in S, and 

2. The sequence {U'"-} converges to U*. 

Proof of Theorem 1 follows. 
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Proof of Theorem 1. The normed space {B, 1 1 • 1 1) is complete w.r.t the metric induced by the supre- 
mum norm. Lemma 2 estabUshes the contraction property of the backup operator, H. Using The- 
orem 3, and substituting T with H, convergence of value iteration in I-POMDPs to a unique fixed 
point is established. □ 

We go on to the piecewise hnearity and convexity (PWLC) property of the value function. 
We follow the outlines of the analogous proof for POMDPs in (Hausktecht, 1997; Smallwood & 
Sondik, 1973). 

Let a : /S* ^ R be a real-valued and bounded function. Let the space of such real-valued 
bounded functions be B{IS). We will now define an inner product. 

Definition 5 (Inner product). Define the inner product, (•, •) : B{IS) x A(/S') M, by 

{a,bi) = y^^bi{is)a{is) 

is 

The next lemma estabUshes the biUnearity of the inner product defined above. 

Lemma 3 (Bifinearity). For any s,t eM., f,g e B{IS), and 6, A G ^{IS) the following equalities 
hold: 

{sf + tg,b) = s{f,b)+t{g,b) 
{f,sb + t\) = s{f,b)+t{f,X) 



We are now ready to give the proof of Theorem 2. Theorem 4 restates Theorem 2 mathemati- 
cally, and its proof follows thereafter. 

Theorem 4 (PWLC). The value function, C/", infinitely nested I-POMDP is piece-wise linear and 
convex (PWLC). Mathematically, 

U'HOi i) = max V bMa'^Us) n = 1, 2, ... 

is 



Proof of Theorem 4. Basis Step: n = 1 

From Bellman's Dynamic Programming equation, 

C/i {Oi) = max V bi {is)ER{is, a^) (20) 

is 

where ERi{is, Oj) = , R{is, a.;. a.j)Pr{a.j\m,j). Here, ERi{-) represents the expectation of 
R w.r.t. agent j's actions. Eq. 20 represents an inner product and using Lemma 3, the inner product 
is linear in bi. By selecting the maximum of a set of linear vectors (hyperplanes), we obtain a PWLC 
horizon 1 value function. 

Inductive Hypothesis: Suppose that U"~^{9i^i) is PWLC. Formally we have. 



f/"-i(eM) =max^.^biiis)a^-^iis) 

(21) 

= ^„I?«^„_i{ Eis:m,e/M, bi{is)a''-^{is) + Eis:m,eSM, bi{is)a''-^ {is)^ 
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Inductive Proof: To show that ^{9^) is PWLC. 



From the inductive hypothesis: 



Let ^ ( 6* ^ , a* ^ , o* ) be the index of the alpha vector that maximizes the value at 6* = S^J ( 6* ^ , a* ^ , o: 
Then, 



From the second equation in the inductive hypothesis: 



< [ 

Substituting 6* with the appropriate belief updates from Eqs. 17 and 17' we get: 



x/3 



0,(s*,a*-\o*) 



X Eo* 0*(«*'^*"^'^5)|^^(^P^ND(/i*-Sop - h'j)I{fh]-\fh'j)Ti{s''\a'~\s') 
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and further 



U-{B,,i) =maxl T.is^-.bY\is'-^)EIU{is*-\at') + lT.o^^ 



J-l\at-l\ 



Oiis\a'-\ol 



+ E 



is*:m*:eSMi 



Oi{s\a'-\ol' 



m*-\m*.)ri(s*-\a*-\s*) 



Rearranging the terms of the equation: 



Oi{s\a'-\oi)Eo^OUs\a'-\o'' 



X << Tg. {b]-\a]-\o],h^^I{e'-\e'^Ti{si-\ai-\ 



E,.-i Pr(ar vr') 



Oi(s*,a*-i,o*)Eo*0*(5*,a*-i,o*) 



= maxi E».t-i:,n*-lg/M, ^( 



Therefore, 



(22) 
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where, if m* ^ in zs* ^ is intentional then a" = a": 



= ERi{is^-\al-') + 7 Eo* ^is-.m^.eiM, E„*.-i ^'K^ 



and, if m* ^ is subintentional then a" = ci": 



X Eoj o*)|<5k(APPEND(/i; 



Eq. 22 is an inner product and using Lemma 3, the value function is linear in Furthermore, 
maximizing over a set of Unear vectors (hyperplanes) produces a piecewise linear and convex value 
function. □ 
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